In this project, we analysis the red wine quality. This report explores a dataset containing and attributes for approximately 1600 with 11 variables. All of the variables are continuous variables.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 3: 10
## 1st Qu.: 9.50 4: 53
## Median :10.20 5:681
## Mean :10.42 6:638
## 3rd Qu.:11.10 7:199
## Max. :14.90 8: 18
## 3 4 5 6 7 8
## 10 53 681 638 199 18
From the histogram, it roughly appears the normal distribution with the quality peak around 5 to 6. From the summary, the mean is 5.636. The best quality is 8 and the worst quality is 3 and the median is 6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The fixed acidity roughly shows normal distribution after changing the scale to log10. The standard deviation is 1.741 and median are 7.9. Most wine of fixed acidity is from 6 to 10(no unit).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The volatile acidity is slightly like the normal distribution with the continuous scale. But if we change the scale to the log10, it also not appears a perfectly normal distribution. After several time changes, I found the power 0.2 is the best to scale for this variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Transformed the long tail data to better understand the distribution of the chlorides. The transformed chlorides distribution appears normal distribution and the peak around 0.79.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
The density range is quite small. Make sense! This variable is also following the normal distribution with continual scale.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The pH variable shows more likely the normal distribution with power 0.1 scale and the mean are 3.311 and the min value is 2.74 and the max value is 4.010.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
This variable is also long tail data, changing the variable to log 10 scale appears much better. The median is 0.62 and the mean is 0.6581.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
The total.sulfur.dioxide variable is definitely log 10 scale variable. The median is 38.00 and the mean is 46.47 and the range from 6 to 289.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The residual sugar is also log 10 scale variable. The median is 2.200 and the mean is 2.539.
The most common alcohol for different quality between 9 to 11.
The histogram of chlorides, sulphates,total.sulfur.dioxide and residual.sugar are right skewed so I’m going to transform the data using a log transform. The histogram of quality,fixed.acidity,volatile.acidity,density,pH variables are continuous variables. So don’t need to change the variable.
Our data set consists of 13 variables, with 1599 observations. There are 5 variables are continuous scale variable which are quality,fixed.acidity,volatile.acidity,density,pH. There are 4 variables are log10 scale variables which are chlorides,sulphates,total.sulfur.dioxide,residual.sugar. The other variables don’t belong to any type of scale and any type of distribution. The quality from 3 to 8 with the worst to best wine quality. Other observations: * Most qualities are 5 and 6. * The median quality is 6. * The max quality is 8. * The best scale for volatile.acidity is power 0.2
The main features in the data set are alcohol, quality. I’d like to determine which features are best for predicting the quality of the red wine. I suspect alcohol,volatile.acidity and residual.sugar can be used to build a predictive model to price diamonds.
The fixed.acidity, volatile.acidity, citric.acid,residual.sugar,chlorides. I think the residual.sugar contribute the most to the quality after reaching the information on quality.
This dataset is not well to create a new variable for analysis. I was trying to create some variable, such pH/density, alcohol/density. So only create a variable quality.bucket.
I log-transformed the right skewed price and volume distributions. All the transformed distribution appears normal distribution. I also transformed two variable with power scale to be more normal distribution.
For the first, we can run a correlation to the quality variable to select the variables we most care about.
## [,1]
## X 0.06645261
## fixed.acidity 0.12405165
## volatile.acidity -0.39055778
## citric.acid 0.22637251
## residual.sugar 0.01373164
## chlorides -0.12890656
## free.sulfur.dioxide -0.05065606
## total.sulfur.dioxide -0.18510029
## density -0.17491923
## pH -0.05773139
## sulphates 0.25139708
## alcohol 0.47616632
## fixed.acidity.ratio.volatile.acidity 0.34346313
## free.sulfur.dioxide.ratio.total.dioxide 0.19411335
From the subset of the data, fixed.acidity,citric.acid,residual.sugar,chlorides,total.sulfur.dioxide,density,pH and free.sulfur.dioxide.ratio.total.dioxide don’t seem to have strong correlations with quality. But the volatile.acidity, sulphates,alcohol and fixed.acidity.ratio.volatile.acidity are moderately correlated with carat. I want to look closer at scatter plots involving the quality and some other variable like alcohol,etc.
We found the positive correlation between quality is as following and we also found all median for the different box is increasing. And we can calculate the coefficient for alcohol to quality is 0.476.
We also found the second correlated variable to quality is sulfates. The coefficient for sulfates to quality is 0.251. The median is also increasing.
Here is some of the negtive correlation for quality:
The coefficient for density to quality is -0.175. We can see this line is decrease and the median is also decrease.
From the density to quality, we can also see the negative correlation. The coefficient for density to quality is -0.391. From the box plot, the median is decreased.
## alcohol volatile.acidity sulphates density
## 1 9.4 0.88 0.68 0.9978
##
## Calls:
## m1: lm(formula = as.numeric(levels(wins$quality))[wins$quality] ~
## alcohol + volatile.acidity + I(log(sulphates)) + density,
## data = wins)
##
## =================================
## (Intercept) 3.447
## (10.456)
## alcohol 0.303***
## (0.019)
## volatile.acidity -1.156***
## (0.097)
## I(log(sulphates)) 0.641***
## (0.080)
## density -0.077
## (10.377)
## ---------------------------------
## R-squared 0.3
## adj. R-squared 0.3
## sigma 0.7
## F 210.4
## p 0.0
## Log-likelihood -1587.8
## Deviance 682.1
## AIC 3187.5
## BIC 3219.8
## N 1599
## =================================
## fit lwr upr
## 1 4.95692 3.671375 6.242464
But we found for the density variable alpha level is not well, so just remove this and create new model for this. Seems like only a little bit change.
##
## Calls:
## m2: lm(formula = as.numeric(levels(wins$quality))[wins$quality] ~
## alcohol + volatile.acidity + I(log(sulphates)), data = wins)
##
## ================================
## (Intercept) 3.369***
## (0.184)
## alcohol 0.303***
## (0.016)
## volatile.acidity -1.156***
## (0.097)
## I(log(sulphates)) 0.641***
## (0.077)
## --------------------------------
## R-squared 0.3
## adj. R-squared 0.3
## sigma 0.7
## F 280.6
## p 0.0
## Log-likelihood -1587.8
## Deviance 682.1
## AIC 3185.5
## BIC 3212.4
## N 1599
## ================================
## fit lwr upr
## 1 4.956922 3.671782 6.242063
Quality correlates strongly with alcohol, volatile.acidity and less correlate with sulfates, citric.acid. As alcohol, volatile.acidity increase, the quality increase. As volatile.acidity increase, quality decrease. All the relation between quality to another variable appears linear.
There is strong positive correlation for citric.acid to fixed.acidity, density to fixed.acidity, total.sulfur.dioxide to free.sulfur.dioxide. There are a strong negative correlation for fixed.acidity to pH, citric.acid to pH, alcohol to density.
The strongest relationship is between fixed.acidity to pH. But it does not make sense for predict the variable we care about the most. The most variable we want to predict is quality. The quality of wine is positively and strongly correlated with alcohol and its coefficient is 0.476. The second correlated variable is volatile.acidity which coefficient is -0.391. The third and fourth correlated variables are sulphates(0.251),citric.acid(0.226) for quality.
We can see from the plot above, high-quality wine appears most frequently for low volatile acidity and high alcohol side.
For the hight quality wine most frequently appears in the upper-right corner which means the high quality with high alcohol and high sulphates. We also found for the lower alcohol wine have more range of sulphates.
For the hight quality wine most frequently appears in the upper-right corner which means the high quality with high alcohol and high sulfates. We also found for the lower alcohol wine have more range of sulphates. We can see from the plot above, high-quality wine appears most frequently to low volatile acidity and high alcohol side.
Sugar has nothing to do with wine quality. I supposed less sugar is high-quality wine.
We created one linear model to predict quality with alcohol, volatile.acidity, sulphates and density. The newWine is an example. It predicts that fit is 4.95692 and confidential 0.95 intervals between 3.671375 to 6.242464. The strengths are alcohol have a huge impact on quality. The limitations are that we didn’t involve the variable like the brand, location etc. It may also affect the wine quality.
The alcohol has a huge impact on the wine quality. We can see from the regression line that one alcohol increase with 0.303 quality increase. There is three line from top to down are quartile 0.9, median and 0.1. For the quality 5, the density is very high.
## 3 4 5 6 7 8
## 10 53 681 638 199 18
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
The quality shows a normal distribution. Amount is so high for the 5 and 6 quality wine which means is more of the wine below to the 5 or 6 quality. For the quality 5 have 681 wines in the dataset and for the quality 6 have 638 wines. The 1st qu is 5 and 3rd qu is 6. The mean for the quality is 5.636 and the median is 6.
As volatile acidity increase, the quality decreases especially from 4 to 5. We can see from regression line, as quality increate one unit the volatile acidity decrease 1.156. For the quality 5 and 6, most of the volatile acidity is from 0.4 to 0.8 according to three quantile lines. Once the quality more than 7, the volatile acidity become horizontal.
The red wine dataset contains information on almost 1600 wine across 12 variables. I started by understanding the top 10 individual variables in the data set, and then I explored intereting qustions and leads as continouse to make observations on the plots. Eventually, I explored the quality of the wine across the many variables and created a linear model to predict wine quality.
There was a clear trend between the alchole, sulphates and volatile.acidity its quality. I was surpriced that residual.sugar didn’t have a strong positive correlation with quality. For the linear model, all the wine were included since information on quality, alcohol, volatile.acidity and sulphates. After transforming sulphates to log scale. The model was able to account for 30% of variance of dataset.
The challenges during my analysis are all the variable is continuous that I can not separate them clearly. So I cut quality into the bucket. But it could misleading for the quality and reader may suppose quality is continouse variable. So I have convert the quality as numberic variable. And the second chanllenge during my analysis is to choose the right plot and right variable for the multi variable plot. Because of this dataset is not quite well for creating multi-variable plot. Eventually, I found that this dataset is fit for quality as color with two variables.
There are some limitation for the dataset. It didn’t put some variable like production date, location, brand etc, into consideration. In the future, we may involve more feature and variable like the brand, production location, production date to improve the prediction result.